Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import zscore
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans
from scipy.spatial import distance
from scipy.cluster.hierarchy import cophenet, dendrogram, linkage
from scipy.spatial.distance import pdist #Pairwise distribution between data points
from scipy.cluster.hierarchy import fcluster
silhouette_df = pd.read_csv("vehicle.csv")
silhouette_df.sample(10)
silhouette_df.shape
silhouette_df.info()
silhouette_df.dropna(inplace=True)
silhouette_df.info()
Since the variable is categorical, you can use value_counts function
silhouette_df['class'].value_counts()
sns.countplot(x='class', data=silhouette_df)
plt.figure()
pd.Series(silhouette_df['class']).value_counts().sort_index().plot(kind = 'bar')
plt.ylabel("Count")
plt.xlabel("class")
plt.title('Number of class');
silhouette_df.isnull().any()
silhouette_df.isnull().sum()
sns.pairplot(silhouette_df, diag_kind='kde' , hue = 'class')
silhouette_df.dtypes
# Class feature is an object column , so it should not be standardized
silhouette_df_X = silhouette_df.drop('class',axis=1)
Since the dimensions of the data are not really known to us, it would be wise to standardize the data using z scores before we go for any clustering methods. You can use zscore function to do this
silhouette_df_z = silhouette_df_X.apply(zscore)
labelencoder=LabelEncoder()
#silhouette_df_z['class'] = labelencoder.fit_transform(silhouette_df['class'])
silhouette_df_z.sample(2)
sns.pairplot(silhouette_df_z, diag_kind='kde')
cluster_range = range(2,10)
cluster_errors = []
distortion = []
for cluster_n in cluster_range :
clusters = KMeans(cluster_n, n_init = 5)
clusters.fit(silhouette_df_z)
labels = clusters.labels_
cluster_errors.append(clusters.inertia_)
distortion.append(sum(np.min(distance.cdist(silhouette_df_z, clusters.cluster_centers_, 'euclidean'), axis=1))/ silhouette_df_z.shape[0])
clusters_df = pd.DataFrame( { "num_clusters":cluster_range, "cluster_errors": cluster_errors } )
clusters_df[0:15]
# Elbow plot
plt.figure(figsize=(12,6))
plt.plot( clusters_df.num_clusters, clusters_df.cluster_errors, marker = "o" )
Use Matplotlib to plot the scree plot - Note: Scree plot plots distortion vs the no of clusters
plt.plot(range(2,10), distortion, 'go-')
Optimal value of k is 3 As per analysing pair plot and elbow plot (bend is maximum at 3)¶
Note: Since the data has more than 2 dimension we cannot visualize the data. As an alternative, we can observe the centroids and note how they are distributed across different dimensions
cluster = KMeans( n_clusters = 3, random_state = 100 )
cluster.fit(silhouette_df_z)
silhouette_df_z_copy = silhouette_df_z.copy(deep = True) # Creating a mirror copy for later re-use instead of building repeatedly
You can use kmeans.clustercenters function to pull the centroid information from the instance
centroids = cluster.cluster_centers_
centroids
Hint: Use pd.Dataframe function
centroid_df = pd.DataFrame(centroids, columns = list(silhouette_df_z) )
centroid_df
prediction=cluster.predict(silhouette_df_z)
silhouette_df_z["PREDICTED_CLASS"] = prediction
silhouette_df_z.sample(5)
cluster.labels_
# can also use silhouette_df_z['PREDICTED_CLASS']
For Hierarchical clustering, we will create datasets using multivariate normal distribution to visually observe how the clusters are formed at the end
np.random.seed(101)
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[100,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[50,])
c = np.random.multivariate_normal([10, 20], [[3, 1], [1, 4]], size=[100,])
print(a[:5])
print(b[:5])
print(c[:5])
print(a.shape)
print(b.shape)
print(c.shape)
hier_clustering_data = np.concatenate((a,b,c),axis=0)
hier_clustering_data[:5]
hier_clustering_df = pd.DataFrame( hier_clustering_data,columns=["A","B"] )
hier_clustering_df.head()
hier_clustering_df.shape
sns.pairplot(hier_clustering_df, diag_kind='kde')
plt.scatter(x=hier_clustering_df['A'],y=hier_clustering_df['B'])
Use ward as linkage metric and distance as Eucledian
# cophenet index is a measure of the correlation between the distance of points in feature space and distance on dendrogram
# closer it is to 1, the better is the clustering
link_z = linkage(hier_clustering_df,'ward')
c, coph_dists = cophenet(link_z , pdist(hier_clustering_df))
c
plt.figure(figsize=(8, 8))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(link_z, leaf_rotation=90.,color_threshold = 50, leaf_font_size=8. )
plt.tight_layout()
Hint: Use truncate_mode='lastp' attribute in dendrogram function to arrive at dendrogram
dendrogram(
link_z,
color_threshold=81,
truncate_mode='lastp', # show only the last p merged clusters
p=3, # show only the last p merged clusters
)
plt.show()
# From the truncated dendrogram the optimal distance between clusters is 80
labels = fcluster(link_z,t=80,criterion='distance')
labels
plt.scatter(x=hier_clustering_df['A'],y=hier_clustering_df['B'],c=labels)